Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: added packet count telemetry metrics #2797

Merged
merged 11 commits into from
Aug 26, 2024

Conversation

JoblersTune
Copy link
Collaborator

@JoblersTune JoblersTune commented Jul 9, 2024

Changes proposed in this pull request

  • Added packet counting metrics for telemetry v2
  • Updated the transactions_total metric to packet_amount_fulfill and included source metadata with the amount metric collection.
  • Moved the collection of all packet data over to the receiving node so that it can log a consistent match of incoming prepares to outgoing responses. This way, since telemetry is optional, we'll get a coherent reporting structure since one node can send coherent results in isolation.
  • Collecting packet counts and packet amount values on the sending node. The values are now collected from the handleIlpData function which is called only on sending nodes, and gives us access to whether packets are unfulfillable so we don't need to count quote packets.
  • Moved packet count and amount metrics out of middleware (since metrics should only be collected based on finalised values after all other middleware has run)
  • Cleaned up old telemetry code that is no longer needed after changes were made

Outside of this PR but related:

  • Changes to the telemetry dashboard to be reviewed as well. I've created a test dashboard called TEST packet count which is viewable on Grafana.

Context

Working towards expanding telemetry for telemetry v2

TODO:

  • add tests
  • Update docs
  • Consider transaction count being on sending node - planning to track on outgoing and incoming

Checklist

  • Related issues linked using fixes #number
  • Tests added/updated
  • Documentation added
  • Make sure that all checks pass
  • Bruno collection updated

@github-actions github-actions bot added pkg: backend Changes in the backend package. type: source Changes business logic labels Jul 9, 2024
Copy link

netlify bot commented Jul 9, 2024

Deploy Preview for brilliant-pasca-3e80ec ready!

Name Link
🔨 Latest commit 1c29f9d
🔍 Latest deploy log https://app.netlify.com/sites/brilliant-pasca-3e80ec/deploys/66c82350308bd900086e7dd7
😎 Deploy Preview https://deploy-preview-2797--brilliant-pasca-3e80ec.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@JoblersTune JoblersTune added the type: telemetry flow ax0n label Jul 9, 2024
@github-actions github-actions bot added type: tests Testing related pkg: documentation Changes in the documentation package. type: documentation (archived) Improvements or additions to documentation labels Jul 15, 2024
@JoblersTune

This comment was marked as resolved.

@JoblersTune JoblersTune force-pushed the 2734/sj/telemetry-count-packets branch 3 times, most recently from 17f5214 to d4bb021 Compare July 25, 2024 13:30
@JoblersTune JoblersTune marked this pull request as ready for review July 29, 2024 14:09
@JoblersTune

This comment was marked as resolved.

@JoblersTune
Copy link
Collaborator Author

Please don't forget to look at the new dashboard additions as well please: https://rafikitelemetry.grafana.net/d/cdq2hn0w6frpcr/test-packet-count?orgId=1

@BlairCurrey
Copy link
Contributor

BlairCurrey commented Jul 30, 2024

Regarding the Packet count (by source) viz, should we either:

  • use the prepare amount for the total
    OR
  • sum the fulfill/reject amounts for the total

I think probably just the prepare amount? Currently we are summing fulfill/reject/prepare but the prepare include fulfill and reject already it looks like. So I think we can just remove the last one here (and related graphs):

image

@BlairCurrey
Copy link
Contributor

How will the existing visualizaitons be impacted? It looks like maybe the new Packet count (by source) replace the Transaction Count (by source) ones but then the Value Sent Through network line graphs will be updated to use the new (and Total Transactions Per Second would need the same)? I like the new stat visualizations but think the existing line graphs are still useful.

@JoblersTune JoblersTune self-assigned this Jul 31, 2024
@JoblersTune JoblersTune removed the request for review from sabineschaller July 31, 2024 07:12
@JoblersTune
Copy link
Collaborator Author

Regarding the Packet count (by source) viz, should we either:

  • use the prepare amount for the total
    OR
  • sum the fulfill/reject amounts for the total

I think probably just the prepare amount? Currently we are summing fulfill/reject/prepare but the prepare include fulfill and reject already it looks like. So I think we can just remove the last one here (and related graphs):

I don't have strong feelings about whichever approach.

I know Max was initially hoping we'd collect this to get a handle on packet loss through the network but this tricky when telemetry is optional.

Given that I assumed we would still be most interested in 1. the total actual number of packets moving through the network (prepare + reject + fulfill) and 2. the relationship between them (we expect prepare = fulfill + reject) and maybe good to see how many rejects we get vs fulfills as well. Of course some of the rejects are quote based as well.

I guess I'm not exactly sure.

Maybe we can just graph fulfill and reject and leave the prepare and totals out?

@JoblersTune
Copy link
Collaborator Author

How will the existing visualizaitons be impacted? It looks like maybe the new Packet count (by source) replace the Transaction Count (by source) ones but then the Value Sent Through network line graphs will be updated to use the new (and Total Transactions Per Second would need the same)? I like the new stat visualizations but think the existing line graphs are still useful.

Good question. This felt obvious when I started but less so now.
I think we need a call about this.
I'm gonna formulate some ideas and then let's rather discuss.

@JoblersTune JoblersTune force-pushed the 2734/sj/telemetry-count-packets branch from 1803725 to 6e29243 Compare August 15, 2024 11:08
BlairCurrey
BlairCurrey previously approved these changes Aug 15, 2024

While `handleIlpData` is an effective location for telemetry collection, it has some limitations:

Sender-Side Limitation: Currently, metrics are only collected on the sender side. This is adequate for now but does not capture data from connecting nodes, which we plan to address in future implementations when multi-hop support is added.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be good to add a short explanation for the connector nodes: eg in A > B > C, we only collect packet information in A

- packet_count_reject: Counts the reject packets received.
- packet_amount_fulfill: Records the amount sent in fulfill packets.

These metrics provide valuable insights, including potential packet loss.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about this, and this is not exactly packet loss like it is defined in the networking sense (since we only count them in the originating connector) , but something like rate of success/error instead


### Challenges with the Current Setup

While `handleIlpData` is an effective location for telemetry collection, it has some limitations:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer not having function names in documentation since it becomes outdated quite quickly, I think we can just remove references to it and keep just the general explanation which you have

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I've included function names is try and avoid any duplicated work when the next person needs to update this or change the metric location. Is there somewhere better for me to use function names and get really specific outside of the README?

Copy link
Collaborator

@koekiebox koekiebox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have a couple of comments.

@mkurapov mkurapov self-requested a review August 19, 2024 16:24
Copy link
Contributor

@mkurapov mkurapov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good, minor comments (main one about some tests)

@@ -1161,6 +1161,85 @@ describe('OutgoingPaymentService', (): void => {
return payment
}

test('Telemetry Transaction Counter increments for COMPLETED transactions', async (): Promise<void> => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for adding this

import { IlpFulfillFactory, IlpRejectFactory } from '../factories'
import { ValueType } from '@opentelemetry/api'

describe('Connector Core Telemetry', () => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should add a few tests for handling non-positive cases (e.g., "... doesn't collect metrics for unfulfillable packets")

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So should I just pass in unfulfillable as true and check it doesn't call incrementCounter?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did my best, have a look and see if the new tests meet this criteria :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, looks good, thanks!

@mkurapov
Copy link
Contributor

@JoblersTune just need to fix the build and good to go

@JoblersTune
Copy link
Collaborator Author

@melissahenderson

This PR has been waiting a while, so I nearly forgot there was a doc change here too.

I'll just paste the text here for your reference and remove the file changes in this PR (telemetry overview page)

---
title: Overview
---

## Purpose

The objective of the telemetry feature is to gather metrics and establish an infrastructure for visualizing valuable network insights. The metrics we at the Interledger Foundation collect include:

- The total amount of money transferred via packet data within a specified time frame (daily, weekly, monthly).
- The number of transactions that have been at least partially successful.
- The number of ILP packets flowing through the network.
- The average amount of money held within the network per transaction.
- The average time it takes for an outgoing payment to complete

We aim to track the growth of the network in terms of transaction sizes and the number of transactions processed. Our goal is to use these data for our own insights and to enable [Account Servicing Entities](/reference/glossary#account-servicing-entity) (ASEs) to gain their own insights.

## Privacy and Optionality

Privacy is a paramount concern for us. Rafiki's telemetry feature is designed to provide valuable network insights without violating privacy or aiding malicious ASEs. For more information, please [read the privacy docs](/telemetry/privacy).

The telemetry functionality is currently enabled by default on test (non-livenet) environments, i.e. any environment that is not dealing with real money. When active, it transmits metrics to the "testnet" collector. In the future, those ASEs operating in a production "livenet" environment (real money) will be able to opt-in to sharing their metrics with a "livenet" collector. Regardless of environment, Account Servicing Entities (ASEs) can also opt-out of telemetry completely.

## Architecture

The architecture of the telemetry feature is illustrated below:

![Telemetry architecture](/img/telemetry-architecture.png)

## OpenTelemetry

We have adopted [OpenTelemetry](https://opentelemetry.io/) to ensure compliance with a standardized framework that is compatible with a variety of tool suites. This allows clients to use their preferred tools for data analysis, while Rafiki is instrumented and observable through a standardized metrics format.

## Telemetry ECS Cluster

The Telemetry Replica service is hosted on AWS ECS Fargate and is configured for availability and load balancing of custom ADOT (AWS Distro for Opentelemetry) Collector ECS tasks.

When ASEs opt for telemetry, metrics are sent to our Telemetry Service. To enable ASEs to build their own telemetry solutions, instrumented Rafiki can send data to multiple endpoints. This allows the integration of a local [Otel collector](https://opentelemetry.io/docs/collector/) container that can support custom requirements. Metrics communication is facilitated through [gRPC](https://grpc.io/).

## Otel SDK - Rafiki Instrumentation

The Opentelemetry SDK is integrated into Rafiki to create, collect, and export metrics. The SDK integrates seamlessly with the OTEL Collector.

## Prometheus - AMP

We use Amazon's managed service for Prometheus (AMP) to collect data from the Telemetry cluster.

**Note**: AMP offers limited configuration options and cannot crawl data outside of AWS. This limitation led us to adopt a push model, using prometheusRemoteWrite, instead of a pull model. For future development, we may consider hosting our own Prometheus.

## Grafana - Grafana Cloud

Grafana Cloud is used for data visualization dashboards. It offers multiple tools that extend Prometheus Promql.

**Note**: We initially used Amazon hosted Grafana, but it did not meet our needs for embedding dashboards. Grafana Cloud offers a feature called “Public dashboards”, which allows us to share dashboards. However, embedding may still pose a challenge.

## Exchange Rates

For telemetry purposes, all amounts collected by instrumented Rafiki should be converted to a base currency.

**Privacy Reasoning**: If only two ASEs are peered over a non-USD currency and we collect data in that currency, it would be easy to determine the volumes moved between those two ASEs. To maintain privacy, we convert all amounts to a base currency.

If an ASE does not provide the necessary exchange rate for a transaction, the telemetry solution will still convert the amount to the base currency using external exchange rates. A Lambda function on AWS retrieves and stores these external exchange rates. It is triggered by a daily CloudWatch event and stores the rates in a public S3 Bucket. The S3 Bucket does not have versioning, and the data is overwritten daily to further ensure privacy.

## Instrumentation

Data points for all metrics (e.g. counter increases) are exported to collection endpoints at a configurable interval (default recommended to 15s).

Currently collected metrics:

- `transactions_total` - Counter metric
  - Description: “Count of funded outgoing transactions”
  - This counter metric increases by 1 for each successfully funded outgoing payment resource.
- `packet_count_prepare` - Counter metric
  - Description: “Count of prepare packets that are sent”
  - This counter metric increases by 1 for each prepare packet that is sent.
- `packet_count_fulfill` - Counter metric
  - Description: “Count of fulfill packets”
  - This counter metric increases by 1 for each fulfill packet that is received.
- `packet_count_reject` - Counter metric
  - Description: “Count of reject packets”
  - This counter metric increases by 1 for each reject packet that is received.
- `packet_amount_fulfill` - Counter metric
  - Description: “Amount sent through the network”
  - This amount metric increases by the amount sent in each ILP packet.
- `transaction_fee_amounts` - Counter metric
  - Description: “Fee amount sent through the network”.
  - This fee amount metric increases by the (amount sent minus amount received) for an outgoing payment.
- `ilp_pay_time_ms` - Histogram metric
  - Description: “Time to complete an outgoing ILP payment”
  - This histogram metric records the time taken to make an ILP payment.

**Note**: The current implementation only collects metrics on the SENDING side of a transaction. Metrics for external open-payments transactions RECEIVED by a Rafiki instance in the network are not collected.

@github-actions github-actions bot removed pkg: documentation Changes in the documentation package. type: documentation (archived) Improvements or additions to documentation labels Aug 23, 2024
@melissahenderson
Copy link
Contributor

@JoblersTune Thank you! I'll get the updates worked in to our other branch.

Copy link
Collaborator

@koekiebox koekiebox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

@JoblersTune JoblersTune merged commit 34afdba into main Aug 26, 2024
42 checks passed
@JoblersTune JoblersTune deleted the 2734/sj/telemetry-count-packets branch August 26, 2024 07:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pkg: backend Changes in the backend package. type: source Changes business logic type: telemetry flow ax0n type: tests Testing related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Track count of prepare, fulfill, and reject packets
5 participants